{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ffff1f1f7edce75e",
   "metadata": {},
   "source": [
    "# Excel File Loading in LangChain\n",
    "\n",
    "- Author: [Hwayoung Cha](https://github.com/forwardyoung)\n",
    "- Peer Review :\n",
    "- Proofread : [Youngjun cho](https://github.com/choincnp)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/05-ExcelLoader.ipynb)[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/05-ExcelLoader.ipynb)\n",
    "## Overview\n",
    "\n",
    "This tutorial covers the process of loading and handling ```Microsoft Excel``` files in ```LangChain``` .\n",
    "\n",
    "It focuses on two primary methods: ```UnstructuredExcelLoader``` for raw text extraction and ```DataFrameLoader``` for structured data processing.\n",
    "\n",
    "The guide aims to help developers effectively integrate Excel data into their ```LangChain``` projects, covering both basic and advanced usage scenarios.\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [UnstructuredExcelLoader](#UnstructuredExcelLoader)\n",
    "- [DataFrameLoader](#DataFrameLoader)\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61d6dfb9165a8a5c",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    "**[Note]**\n",
    "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2e8882060efa1ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea721609cfbfab70",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langchain_community\",\n",
    "        \"unstructured\",\n",
    "        \"openpyxl\"\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3552429c2252ddc",
   "metadata": {},
   "source": [
    "## ```UnstructuredExcelLoader```\n",
    "\n",
    "```UnstructuredExcelLoader``` is used to load ```Microsoft Excel``` files.\n",
    "\n",
    "This loader works with both ```.xlsx``` and ```.xls``` files.\n",
    "\n",
    "When the loader is used in ```mode=\"elements\"``` , an HTML representation of the Excel file is provided under the ```text_as_html``` key in the document metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49f04d7628772561",
   "metadata": {},
   "outputs": [],
   "source": [
    "# install\n",
    "# %pip install -qU langchain-community unstructured openpyxl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "edf68cf0efa0ce27",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from langchain_community.document_loaders import UnstructuredExcelLoader\n",
    "\n",
    "# Set recursion limit\n",
    "sys.setrecursionlimit(10**6)    \n",
    "\n",
    "# Create UnstructuredExcelLoader \n",
    "loader = UnstructuredExcelLoader(\"./data/titanic.xlsx\", mode=\"elements\")\n",
    "\n",
    "# Load a document\n",
    "docs = loader.load()\n",
    "\n",
    "# Print the number of documents\n",
    "print(len(docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4953c041db717589",
   "metadata": {},
   "source": [
    "This confirms that one document has been loaded.\n",
    "\n",
    "The ```page_content``` contains the data from each row, while the ```text_as_html``` in the ```metadata``` stores the data in HTML format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81447e1ba5a5b4a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print the document\n",
    "print(docs[0].page_content[:200])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9cf4f7c409fd937",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print the text_as_html of metadata\n",
    "print(docs[0].metadata[\"text_as_html\"][:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cce664281d9d57f4",
   "metadata": {},
   "source": [
    "![text_as_html](./assets/05-excel-loader-text-as-html.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ed860d9960d54a7",
   "metadata": {},
   "source": [
    "## ```DataFrameLoader```\n",
    "\n",
    "- Similar to CSV files, we can load Excel files by using the ```read_excel()``` function to create a ```pandas.DataFrame```, and then load it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb49d72b095ede6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# read the Excel file\n",
    "df = pd.read_excel(\"./data/titanic.xlsx\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8a978b795661c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import DataFrameLoader\n",
    "\n",
    "# Set up DataFrame loader, specifying the page content column\n",
    "loader = DataFrameLoader(df, page_content_column=\"Name\")\n",
    "\n",
    "# Load the document\n",
    "docs = loader.load()\n",
    "\n",
    "# Print the data\n",
    "print(docs[0].page_content)\n",
    "\n",
    "# Print the metadata\n",
    "print(docs[0].metadata)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
